Penguins in Python...

...and the Grammar of Graphics and some pandas missing data gotchas

This notebook explore the dataset discussed in the video "Surprising values. Missing values. Giant penguins" and, in doing so, gives you a brief taste of the Grammar of Graphics concept. It also aims to raise your awareness about the way in which the pandas library currently (mis-)treats missing data so that you can be on the lookout for this in practice.

How to use this code walk-through

While this notebook can be run from beginning to end with no input from you, the point is for you to step through it cell-by-cell to get a sense of the concepts that it implements. Feel free to experiment by changing aspects of the code and observing the effects.

To go deeper, we recommend Jeroen Janssens' Plotnine: Grammar of Graphics for Python tutorial.

The Grammar of Graphics concept

The word grammar means the rules of a language, including how its different elements are combined to convey meaning. Having widely understood rules and conventions helps us to communicate with each other.

Leland Wilkinson's Grammar of Graphics was published in 1999 and proposed rules for how we define and combine different graphical elements to reveal insight from data. This notion was picked up by Hadley Wickham who proposed and implemented a layered grammar of graphics in R in what is now an extremely powerful and popular package called ggplot2. Hassan Kibirige has developed a python implementation of ggplot called plotnine and writes:

The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot. Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots easy to think about and then create, while the simple plots remain simple.

This notebook aims to give you a taste of ggplot and plotnine so that you can be aware of how useful it is to have a layered grammar of graphics. As Brian Cusack explains, three layers are essential:

  • Data: The actual variables to be plotted.
  • Aesthetics: The scales onto which we will map our data.
  • Geometries: Shapes used to represent our data.

In ggplot, these three layers are defined

  1. Data: Specified by the first argument to ggplot()
  2. Aesthetics: Specified in the aes(…) function
  3. Geometries: Created with a geom(…) function

Your first ever ggplot

Let's import the libraries we need to demonstrate the grammar of graphics...

In [1]:
from plotnine import *
import pandas as pd
import numpy  as np

...hide some informative warnings that are not critical to this exercise

In [2]:
import warnings
warnings.filterwarnings('ignore')

...get some data...

In [3]:
pngns = pd.read_csv("./data/pngns.csv")

...and make our first ggplot:

In [4]:
ggplot(
    pngns,                 # Using the pngns data
    aes(
        x='height',        # map height to the x scale
        y='species',       # map species to the y scale
        colour='species')  # map species also to the colour scale
    ) + geom_jitter()      # show each observation as a jittered point
Out[4]:
<ggplot: (-9223371846598519124)>

In the line of code above

  • The data is the pngns data frame
  • The (continuous) height variable is mapped to the x scale
  • The (categorical) species variable is mapped to the y scale
  • The (categorical) species variable is mapped to the color scale
  • Each observation is represented by a jittered point

With these basic elements in place, we can now

  • adjust the way that points are jittered (only vertically)
  • adjust the non-data elements of the plot (e.g., by hiding the legend)
  • add annotations (e.g., a title):
In [5]:
ggplot(pngns, aes(x='height', y='species', colour='species')) + \
    geom_jitter(width=0, alpha=0.5) + \
    theme(legend_position = "none") + \
    labs(title='Original pngns data')
Out[5]:
<ggplot: (-9223371846599242812)>

This is essentially the first plot we see in the video.
(There are some additional annotations, i.e., a subtitle and a caption... plotnine hasn't implemented these yet.)
Just for comparison, here is the R code used to produce the plot in the video:

ggplot(pngns, aes(x=height, y=species, colour=species)) + 
  geom_jitter(width=0, alpha=0.5) +
  theme(legend.position = "none") +
  labs(
    title="Original pngns data",
    subtitle="Natural scale",
    caption="Points jittered and grouped by species."
  )

...pretty similar. It's nice to think that the same layered grammar of graphics can be used in both Python and R.

To suppress the output and just show the figure, we wrap the entire ggplot call in parentheses, call the .draw() method and finish with a ; to prevent output to this notebook, like this:

In [6]:
(ggplot(pngns, aes(x='height', y='species', colour='species')) + \
    geom_jitter(width=0, alpha=0.5) + \
    theme(legend_position = "none") + \
    labs(title='Original pngns data')).draw();

Now, on with the story in the video:

Whoa! That’s a surprise. There are some really tall penguins out there! Wow! Amazing!

Your research collaborators are keen to get the data analysed so they can submit their findings to a prestigious journal, so you write a command to calculate the average heights observed for each species… Hit enter… And...

Big trouble with pandas and missing values

In [7]:
pngns[['species', 'height']].groupby(['species']).mean()
Out[7]:
height
species
Adelie 2983.908271
Chinstrap 4957.523967
Gentoo 1940.669722

Here's where R and Python (specifically pandas) differ quite significantly in their treatment of missing values.

Recall that in the video, the output of the R command to get the means of the groups was:

> pngns %>% 
+   group_by(species) %>%
+   summarise(
+     average.height=mean(height)
+   )
# A tibble: 3 x 2
  species   average.height
  <fct>              <dbl>
1 Adelie                NA
2 Chinstrap             NA
3 Gentoo                NA

If you compare this with the pandas version above, you will see that

  • pandas silently removes missing values from its calculation of the group means.
  • R noisily reminds us that our data contains missing values by saying NA NA NA

This is a major shortcoming with pandas, especially when it comes to working with real-world data where missing values are commonplace.

pandas uses the IEEE-754 standard floating point value NaN ("Not a Number) to indicate missing values: this is not what NaN was meant for in the standard; it was intended to be used when computations (e.g., 0/0) produced results that were undefined, not to indicate the absence of a value. In contrast R has a dedicated NA value that indicates missing values for floating point, integer, string and boolean variables.

Here's an example to show how pandas hides the reality of our data from us. Let's create a data frame with three groups of values

  • A, which has no missing values
  • B, which is entirely missing values
  • C, which has one missing value
In [8]:
df = pd.DataFrame({
    'group':  ["A", "A",    "B",    "B",   "C",    "C"],
    'values': [  1,   0, np.NaN, np.NaN,  1, np.NaN]
})
df
Out[8]:
group values
0 A 1.0
1 A 0.0
2 B NaN
3 B NaN
4 C 1.0
5 C NaN

If we calculate some descriptive statistics of the values within each of the three groups, we get

In [9]:
df.groupby('group')['values'].agg(['count', 'size', 'sum', 'mean', 'min', 'max'])
Out[9]:
count size sum mean min max
group
A 2 2 1.0 0.5 0.0 1.0
B 0 2 0.0 NaN NaN NaN
C 1 2 1.0 1.0 1.0 1.0

Note especially that the sum of values in group B is wrong.

If we add two NaN values by hand, the result is (correctly) undefined:

In [10]:
np.NaN + np.NaN
Out[10]:
nan
In [11]:
pd.DataFrame([np.NaN,np.NaN]).sum().item()
Out[11]:
0.0

as remarked by user the-moose-machine on github in relation to sum():

All NaN values are picked up as 0. This is useless when manipulating data for academic research.

We can write some functions that do not skip over any missing values:

In [12]:
def sumNA(series):
    return(series.sum(skipna=False))

def meanNA(series):
    return(series.sum(skipna=False)/series.size)

def minNA(series):
    return(series.min(skipna=False))

def maxNA(series):
    return(series.max(skipna=False))
In [13]:
df.groupby('group')['values'].agg(['count', 'size', 'sum', sumNA, 'mean', meanNA, 'min', minNA, 'max', maxNA])
Out[13]:
count size sum sumNA mean meanNA min minNA max maxNA
group
A 2 2 1.0 1.0 0.5 0.5 0.0 0.0 1.0 1.0
B 0 2 0.0 NaN NaN NaN NaN NaN NaN NaN
C 1 2 1.0 NaN 1.0 NaN 1.0 NaN 1.0 NaN

In summary when the missing value NA is involved in computations in R, the result is also NA... whereas pandas silently treats NaN as zero... an extremely bad thing to do because it simultaneouly masks a problem in the data and creates a misleading result.

Mark my words, there'll be a financial/space-flight/medical disaster as a result... and another thing...

...ah. Sorry. Back to the story.

Back to the story...

Using a slightly different command which handles missing data appropriately, you calculate the average heights of the species and...

In [14]:
pngns[['species', 'height']].groupby(['species']).agg([meanNA])
Out[14]:
height
meanNA
species
Adelie NaN
Chinstrap NaN
Gentoo NaN

Whoa! Another surprise what’s that “NA” mean?
After some quick googling, you learn that NA stands for “not available” that a value is missing for some reason.
No problem! You heard once that you could replace any missing values with zero because they have zero effect on your analysis.
So you put in a command to change any height that is NA to zero, or otherwise leave it as is.

In [15]:
pngnsNA0 = pngns.fillna(0)
In [16]:
pngnsNA0[['species', 'height']].groupby(['species']).agg([meanNA]).round(0)
Out[16]:
height
meanNA
species
Adelie 2867.0
Chinstrap 4593.0
Gentoo 1894.0

Imagine we are reanalysing the data

In the next part of the story we put the data on a logarithmic scale to see the data better across different orders of magnitude. ggplot makes this easy to do by introducing another layer of graphics: the scale. To see the impact of this, try running the following command, then commenting out the call to scale_x_log10():

In [17]:
(ggplot(pngns, aes(x='height', y='species')) + \
  geom_jitter(aes(colour='species'), width=0, alpha=0.5) +\
  scale_x_log10(
    breaks=[0.1, 1, 10, 100, 1000, 10000, 100000], 
    labels=['0.1', '1', '10', '100', '1,000', '10,000', '100,000']) +\
  theme(legend_position = "none") +\
  labs(
    title="Original pngns data",
    subtitle="Logarithmic scale",
    caption="Points jittered and grouped by species."
    )).draw();

Next, we change the y-axis and colour scheme to show who observed the data. Note that this simply involves mapping y='researcher' and colour='researcher':

In [18]:
(ggplot(pngns, aes(x='height', y='researcher')) + \
  geom_jitter(aes(colour='researcher'), width=0, alpha=0.5) +\
  scale_x_log10(
    breaks=[0.1, 1, 10, 100, 1000, 10000, 100000], 
    labels=['0.1', '1', '10', '100', '1,000', '10,000', '100,000']) +\
  theme(legend_position = "none") +\
  labs(
    title="Original pngns data",
    subtitle="Logarithmic scale",
    caption="Points jittered and grouped by the researchers who collected the data."
    )).draw();

In R, we filter out Uli's extreme values like this:

filter(pngns, researcher=="Uli", height > 10000)

With pandas, the same can be achieved by:

In [19]:
pngns.loc[(pngns.researcher=="Uli") & (pngns.height > 10000)]
Out[19]:
species researcher date height bill_depth_mm bill_length_mm
0 Adelie Uli 2022-04-12 99999.0 18.7 39.1
49 Adelie Uli 2022-04-09 99999.0 21.2 42.3
108 Adelie Uli 2022-04-09 99999.0 17.0 38.1
145 Adelie Uli 2022-04-09 99999.0 18.7 39.0
191 Gentoo Uli 2022-04-12 99999.0 15.7 48.7
209 Gentoo Uli 2022-04-09 99999.0 15.0 45.5
277 Chinstrap Uli 2022-04-09 99999.0 19.5 50.0
308 Chinstrap Uli 2022-04-12 99999.0 16.7 42.5
319 Chinstrap Uli 2022-04-12 99999.0 17.0 45.5

Next, we eplace any instances of 999 999 with an None and convert all measurements to the centimetres scale by dividing Stig’s millimetre measurements by 10 and multiplying Freya’s metre scale measurements by 100. Then we replot the data

In [20]:
pngns_fixed=pngns
pngns_fixed.loc[(pngns.researcher=="Uli") & (pngns.height > 10000), 'height'] = None
pngns_fixed.loc[(pngns.researcher=="Stig"),  'height'] = pngns.loc[(pngns.researcher=="Stig"),  'height']/10
pngns_fixed.loc[(pngns.researcher=="Freya"), 'height'] = pngns.loc[(pngns.researcher=="Freya"), 'height']*100
In [21]:
(ggplot(pngns_fixed, aes(x='height', y='researcher')) + \
  geom_jitter(aes(colour='researcher'), width=0, alpha=0.5) +\
  expand_limits(x=0) +\
  theme(legend_position = "none") +\
  labs(
    title="Original pngns data",
    subtitle="Logarithmic scale",
    caption="Points jittered and grouped by the researchers who collected the data."
    )).draw();
In [22]:
(ggplot(pngns_fixed, aes(x='height', y='species')) + \
  geom_jitter(aes(colour='species'), width=0, alpha=0.5) +\
  expand_limits(x=0) +\
  theme(legend_position = "none") +\
  labs(
    title="Original pngns data",
    subtitle="Natural scale",
    x="height (cm)",
    caption="Points jittered and grouped by species."
  )).draw();

What happens if we replace missing data with zeros?

As data scientists, our job is not to get rid of unexpected values. Our job is to understand how those values came to be and then take appropriate action...
In the original analysis, missing values were replaced by zero in the mistaken belief that this would somehow have zero effect.
Replacing missing values with actual values is called imputation.
And the imputed values that we choose can have a significant impact on the conclusions that we draw from our analysis.

Here we create summaries of the data with NA values of height removed

In [23]:
pngns_fixed_heightNArm = \
    pngns_fixed[['species', 'height']].\
    groupby(['species']).\
        mean().round(1).reset_index()
# pngns_fixed_heightNArm

...and with NA values of height (incorrectly) replaced by 0:

In [24]:
pngns_fixedNA0 = pngns_fixed.fillna(0)
pngns_fixed_heightNA0 = \
    pngns_fixedNA0[['species', 'height']].\
    groupby(['species']).\
        mean().round(1).reset_index()
# pngns_fixed_heightNA0

...which we plot for comparison

In [25]:
(ggplot(pngns_fixedNA0, aes(y='species', colour='species')) + \
  geom_jitter(aes(x='height'), width=0, alpha=0.5, show_legend=False) + \
  geom_errorbarh(pngns_fixed_heightNA0, aes(xmin='height', xmax='height', y='species'), show_legend=False) + \
  annotate('label', x=pngns_fixed_heightNA0.height-5, y = pngns_fixed_heightNA0.species, label='NAs→0', ha='right') +\
  expand_limits(x=0) +\
  labs(title="Original pngns data with NA values set to 0",  x="height (cm)")).draw();

(ggplot(pngns_fixed, aes(y='species', colour='species')) + \
  geom_jitter(aes(x='height'), width=0, alpha=0.5, show_legend=False) + \
  geom_errorbarh(pngns_fixed_heightNArm, aes(xmin='height', xmax='height', y='species'), show_legend=False) + \
  annotate('label', x=pngns_fixed_heightNArm.height+5, y = pngns_fixed_heightNArm.species, label='NAs→0', ha='left') +\
  expand_limits(x=0) +\
  labs(title="Original pngns data with NA values removed",  x="height (cm)")).draw();

When do the NA values occur?

Here we reproduce in python the plot of number of NAs by date:

In [26]:
NAs_by_date = \
    pngns_fixed[['date', 'height']].\
    groupby(['date']).\
        agg({'height': lambda x: x.isnull().sum()}).\
    rename(columns={"height": "NAs"}).\
    reset_index()                                      # Need this to make sure the date column appears in the output
In [27]:
(ggplot(NAs_by_date, aes(x='date', y='NAs')) + \
  geom_col() + \
  labs(
    title="Original pngns data",
    subtitle="Number of NAs by date",
    y="Number of NAs"
  )).draw();

Exploring multivariate outliers

The final part of the video explores a range way to explore multivariate outliers, including scatterplots, parallel coordinates plots and pairs plots, all of which can be reproduced in python, using plotnine, plotly.express and seaborn libraries.

Unfortunately, there appears to be problems with plotnine's violin plots in this instance:

In [28]:
(ggplot(pngns_fixed, aes(x='bill_length_mm', y='species', colour='species')) + \
  geom_violin() + \
  geom_jitter(width=0, height=0.1) + \
  labs(
    title="Original pngns data",
    subtitle="Natural scale.",
    x="bill length (mm)",
    y="species"
)+ theme(legend_position = "none")).draw();
In [29]:
(ggplot(pngns_fixed, aes(x='bill_length_mm', y='bill_depth_mm')) + \
  geom_point() + \
  labs(
    title="Original pngns data",
    subtitle="Natural scale.",
    x="bill length (mm)",
    y="bill depth (mm)"
)).draw();
In [30]:
(ggplot(pngns_fixed, aes(x='bill_length_mm', y='bill_depth_mm',colour='species')) + \
  geom_point() + \
  labs(
    title="Original pngns data",
    subtitle="Natural scale.",
    x="bill length (mm)",
    y="bill depth (mm)"
) + theme(legend_position = "none") ).draw();
In [31]:
(ggplot(pngns_fixed, aes(x='bill_length_mm', y='bill_depth_mm', colour='species')) + \
  geom_point() + geom_smooth(method="lm") +\
  labs(
    title="Original pngns data",
    subtitle="Natural scale.",
    x="bill length (mm)",
    y="bill depth (mm)"
)  + theme(legend_position = "none") ).draw();
In [32]:
import plotly.express as px

fig = px.parallel_coordinates(
    pngns_fixed,
    dimensions=['bill_length_mm', 'bill_depth_mm', 'height'],
    color=pngns_fixed['species'].map({'Adelie': 1, 'Chinstrap': 2, 'Gentoo': 3}),
    labels={
        'bill_length_mm': 'Length', 
        'bill_depth_mm' : 'Depth',
        'height'        : 'Height'
    }
)
fig.show()
In [33]:
import seaborn as sns

g = sns.PairGrid(pngns_fixed, hue="species")
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
g.add_legend();

In conclusion

We hope this has given you a sense of the grammar of graphics concept and its implementation in plotnine. This is really meant as a taster, rather than a tutorial or deep dive. At this stage of your journey into data science, it is good for you to have awareness of the breadth and depth of visualisation methods at our disposal. It is, however, critical that you are aware of how pandas currently mistreats missing data so that you are able to avoid serious issues in analysing data with missing values.